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ABSTRACT 

This document examines the concepts of validity, reliability, 
and appropriateness from a language testing perspective as they apply to the 
following four assessment issues raised by the National Reporting System 
(NRS) : (1) What type of language assessment seems to be required by the NRS : 

proficiency or achievement?; (2) Knowing that the NRS focuses on what the 
learner can do in the real world, and knowing the challenges to classroom 
teaching, what type of assessment would be most appropriate for the NRS?; (3) 
Knowing that all these variables need to be attended to, what does validity 
entail for appropriate NRS assessment?; and (4) What does reliability mean 
for performance measures meeting the rigorous requirements of the NRS? The 
document concludes that ensuring that language tests for adult English 
language learners are appropriate, valid, and reliable is challenging. 
Performance based assessments are inherently complex to develop and 
implement. Yet, because the focus of assessment--both in the NRS descriptors 
and in the Department of Education's definition of content standards — is on 
what learners can do with the language, performance assessments are worth 
developing and validating. (Adjunct ERIC Clearinghouse for ESL Literacy 
Education.) (Contains 16 references.) (SM) 
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Since 1998, federal guidelines have stated that assess- 
ment procedures to fulfill the accountability requirements 
of the Workforce Investment Act (WIA) must be valid, 
reliable, and appropriate (U.S. Department of Education, 
2001 ). This provision is likely to remain in the reauthoriza- 
tion of the WIA in the coming year. The U.S. Department 
of Education's blueprint for this legislation continues to 
call for local programs to demonstrate learner gains in 
reading, math, language arts, and English language acqui- 
sition. It also requires states to implement content stan- 
dards — defined as clear statements of what learners should 
know and be able to do — and to align assessments with 
these standards (U.S. Department of Education, 2003). The 
educational functioning levels, which form the basic frame- 
work and structure of the National Reporting System 
(NRS), remain. These levels may be refined (e.g., test 
benchmarks may be linked to real-life outcomes) (Keenan, 
2003). 

As the field of adult English as a second language (ESL) 
instruction moves towards content standards, program 
staff and state and national policy makers need to be able 
to make informed choices about appropriate assessments 
for adult English language learners. This Q&A examines 
the concepts of validity, reliability, and appropriateness 
from a language testing perspective as they apply to the 
following four assessment issues raised by the National 
Reporting System: 

1 . What type of language assessment seems to be required 
by the NRS: proficiency or achievement? 

2. What type of assessment would be most appropriate for 
the NRS? 

3. What does validity entail for appropriate NRS assess- 
ment? 

4. What does reliability mean for performance measures 
meeting the rigorous requirements of the NRS? 

What type of language assessment seems to be 
required by the NRS: proficiency or achievement? 

For adult English language learners in the United States, 
the basic reason for learning English is to use it. It is not to 
know about grammar or sophisticated details of English 
syntax or cultural aspects of the land where the language 



is spoken. All of these have their place, but knowing a 
language involves being able to put all of these pieces 
together in order to read for work or enjoyment, participate 
in conversations with others who speak English, or accom- 
plish other tasks using the language. 

Traditionally, achievement testing has been defined as 
assessing whether students have learned what they have 
been taught. Today, as the field of education institutes 
standards, assessment frameworks look not only at what 
students know about the language, but at what they can do 
with it. For adult language learners, that means using the 
language in everyday life. The goal of learning then, is to 
develop proficiency. The American Council on the Teach- 
ing of Foreign Languages (ACTFL) defines language profi- 
ciency as "language performance in terms of the ability to 
use the language effectively and appropriately in real-life 
situations" (Buck, Byrnes, & Thompson, 1989, p. 11). 

Proficiency distinguishes itself from achievement in 
that, when measuring language skills especially, profi- 
ciency is not necessarily confined to what is taught in the 
classroom. Language acquisition - learning of new vo- 
cabulary and structures — also occurs outside the class- 
room as learners live, work, and interact with others in an 
English-speaking environment (Gass, 1997). 

The NRS defines six educational functioning levels for 
English language learners. These levels describe what 
learners can actually do. For example, learners at the 
Beginning ESL listening and speaking level 

♦ can understand frequently used words in context and 
very simple phrases spoken slowly with repetition, 

♦ can communicate basic survival needs with some help, 
and 

♦ can understand and participate effectively in face-to- 
face conversations on everyday subjects spoken at 
normal speed. 

These aims are focused on what happens in real lifeoutside 
the classroom. In language testing terms, the focus of the 
NRS is on proficiency. 

The challenge, both for teaching and assessment, is 
determining the relationship among content standards, 
curriculum, instruction, and proficiency (versus achieve- 
ment) outcomes. The field of foreign language education 
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has been struggling for several decades to align these. 
Currently, only a few states have standards for adult ESL 
instruction. Among these are Arizona, California, Florida, 
Massachusetts, New York, and Washington (Florez, 2002). 
If content standards define what learners can do in the real 
world (proficiency), then how do these standards influence 
what happens in the classroom, particularly how profi- 
ciency is assessed? 

Adult learners come to the classroom with a variety of 
prior educational and life experiences. In acquiring En- 
glish literacy, learners require different curricula and in- 
structional strategies depending on whether they have ever 
acquired literacy in any language, have a high level of 
literacy in their own language, or are literate in a language 
that uses a Roman or a non-Roman alphabet (Burt, Peyton, 
& Adams, 2003). Learners also differ in their opportunities 
for language acquisition outside the classroom. For ex- 
ample, they may work in jobs where contact with native 
English speakers or speakers of other languages require 
them to use English, or they may work in jobs with very 
littlecontact with other workers — particularly English speak- 
ers. Some learners are able to attend class several times a 
week and others only once. A couple hours of instruction 
a week is a very limited amount of time for developing 
English language proficiency. What goes on inside the 
classroom needs to help learners take advantage of what 
goes on outside the classroom, so that learners can maxi- 
mize opportunities to increase their language acquisition 
(Van Duzer, et aL, 2003). 

Classroom assessments (e.g., reading, writing or speak- 
ing logs; checklists of communication tasks; and oral or 
written reports; See Van Duzer & Berdan [2000] for list and 
discussion) can show how learners have mastered curricu- 
lar content or met their goals. The assessments may reflect 
what the learners can do in the real world. However, 
without specific valid and reliable links to the NRS func- 
tioning levels, these tools and processes may not meet the 
current requirements to show level gain. 

Knowing that the NRS focuses on what the learner 
can do in the real world and knowing the chal- 
lenges to classroom teaching, what type of assess- 
ment would be most appropriate? 

A good language proficiency test is made up of language 
tasks that replicate what goes on in the real world (Bachman 
& Palmer, 1 996). Performance assessments — which re- 
quire test takers to demonstrate their skills and knowledge 
in a manner that closely resembles a real-life situation or 
setting (National Research Council, 2002) — seem appro- 
priate. A performance assessment generally has more 
potential than a selected response test (e.g., true-false or 



multiple choice) to replicate language use in the real 
world. That potential is realized, however, only if the 
assessment itself is of high technical quality, not just 
because it is a performance assessment. 

Performance assessments are not easy to develop, ad- 
minister, score, and validate because there are many 
variables involved. The Performance-Based Assessment 
Variables model (see insert) illustrates the many variables 
that apply to the development of performance-based as- 
sessments (adapted from Kenyon, 1 992; McNamara, 1 996, 
and Skehan, 1 998). 

At the bottom, there is the student (or examinee) whose 
underlying competencies (knowledge, skills, and abilities 
[K/S/A]) are to be assessed. To do this, the student is given 
tasks to perform. 

Several variables surround these tasks. What is the 
quality of the task? Is it a good task or a poor task? Are 
conditions provided so that it can be successfully com- 
pleted? Will the learners be given enough time to do it? 

Next there is a test administrator who may interact with 
the examinee. The administrator may bring his or her own 
underlying competencies (knowledge, skills, and abilities) 
into the student's performance. Does the administrator 
know what to ask the student to do and how to ask it? 

These three elements interact (student, task, and admin- 
istrator) to produce a performance. However, that perfor- 
mance needs to be assessed by a rater. Sometimes the rater 
and the administrator are the same (e.g., in an oral inter- 
view) and other times the rater can be somebody different 
(e.g., in a writing assessment). Raters also bring additional 
variables. Are they well trained? Do they have the knowl- 
edge base needed to rate the performance? 

In order to assess the student's performance, raters need 
criteria, often contained in a scale or a rubric. The rubric 
needs to be usef u I a n d easy to i nterpret, a nd it m ust address 
the aspects of the performance related to the examinee's 
underlying competencies that are to be assessed. For 
example, if writing is being assessed, do the rating criteria 
relate to characteristics of a good writer (e.g., ability to 
organize the writing, ability to use appropriate mechan- 
ics); if speaking is being assessed, do the criteria relate to 
competencies of a good speaker (e.g., ability to make 
oneself understood)? 

Finally, raters use the rubric or scale to assign a score to 
the performance. This score has meaning only in so far as 
it isa valid and reliable measureof whatthe learner can do. 
In other words, do the many variables depicted in the 
diagram work together to produce a score that is a valid 
indicator of an examinee's ability? Does the performance 
assessment allow the examinee to give a performance that 
reflects proficiency in the real world, can be adequately 
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described and measured by the rubric, and can be scored 
reliably? Can the assessment be repeated, both in terms of 
the performance being elicited and the score applied? 

Knowing that all these variables need to be at- 
tended to, what does validity entail for an appro- 
priate NRS assessment? 

Messick (1 989) offers the following technical definition 
of validity: 

Validity is an integrated evaluative judgment of 
the degree to which empirical evidence and 
theoretical rationale support the adequacy and 
the appropriateness of inferences and actions 
based on test scores or other modes of 
assessments" (p.1 1). (emphasis from the original) 

This view adjusts the focus of validity from the test itself 
to include the use of the test scores. One has to ask if the 
test is valid for this use, in this context, for this purpose. 
With regards to the NRS, the main questions that need to 
answered seem to be the following: How well is the 
performance elicited by the test aligned with the NRS 
descriptors? How well can the test assess yearly progress? 
How indicative of program quality are the performances 
on the assessment? 

Any assessment used for NRS purposes will be valid 
only if evidence can be provided that the inferences about 
the learners made on the basis of the test scores can be 
related to the NRS descriptors — i.e., what the learners can 
do (proficiency). The assessment must also be sensitive 
enough to learner gains to be able to show progress, if that 
is the use to which it is put. In addition, if the quality of 
programs is to be judged by performances on the assess- 
ment, then it must be demonstrated that there is a relation- 
ship between the two. 

Establishing validity for a particular use of a test is not a 
one-activity task or study. It is an accumulation of evi- 
dences that support the use of that test. It includes such 
things as examining the relationship between performance 
on the test and performance on similar assessments, exam- 
ining test performances vis-a-vis criteria inherent in the 
NRS descriptors, and examining the reasonableness and 
consequences of decisions madeon the basis of test scores. 
Each of these examinations requires the collection and 
analysis of evidence (data). 

What does reliability mean for performance assess- 
ments meeting the rigorous requirements of the 
NRS? 

In the field of assessment, the concept of reliability is 
related to the consistency of the measurement when the 
testing procedure is repeated on a population of individu- 



als or groups (American Educational Research Associa- 
tion, American Psychological Association, and National 
Council on Measurement in Education, 1999). For ex- 
ample, if a learner takes a test once, then takes it again one 
hour later, and maybe another hour after that, the learner 
should get about the same score each time, provided 
nothing else has changed. 

As the diagram indicates, a performance assessment has 
a number of potential sources for inconsistency. These 
include the assessment task itself, the administrator, the 
rater, the procedure, the conditions under which it is 
administered, or even the examinee. For example, an 
examinee might be feeling great the day of the pre-test but 
facing a family crisis on the day of the post-test. 

The job of assessment developers is to demonstrate that 
reliability can be achieved even for a complex perfor- 
mance assessment. Accordingly, program staff using the 
test have a responsibility as well. They have an obligation 
to administer the assessment in the ways they have been 
trained to administer it, thus replicating the conditions 
under which reliability can be attained (American Educa- 
tional Research Association, American Psychological As- 
sociation, and National Council on Measurement in 
Education, 1 999). Programs need to plan for time to train 
individuals to administer the test, time to administer it, and 
time to monitor its administration. This may mean an 
additional expenditure of resources and time for staff 
training so that the test will be administered appropriately 
each time it is used. Finally, before post-testing, programs 
must ensure that enough time (or hours of instruction) has 
passed for learners to show gains. 

Conclusion 

Ensuring that language tests for adult English language 
learners are appropriate, valid, and reliable is a challenge. 
Performance-based assessments are inherently complex to 
develop and implement. Yet, because the focus of assess- 
ment — both in the NRS descriptors and in the Department 
of Education's definition of content standards — is on what 
learners can do with the language, performance assess- 
ments are worth developing and validating. 

Meanwhile, as program staff choose assessments that 
meet current accountability requirements, they can take 
the following steps to ensure that valid, reliable, and 
appropriate assessments are chosen for their learners: 

♦ Review the assessment and technical information pro- 
vided by the test developer to determine that what the 
assessment purports to measure reflects real life tasks. 

♦ Review the technical manual to ascertain that the test 
developers have demonstrated that reliability can be 
achieved. 
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♦ Provide adequate resources to train test administrators 
and raters to maintain reliability of test administration 
and scoring. 

♦ Post-test only after an adequate amount of instructional 
time has taken place to demonstrate level gain. 
Presently, assessment of learner gains is based on the 

NRS descriptors. Over the next few years, content stan- 
dards will be implemented as well. If we cannot assess 
learners' performances in light of those standards in valid, 
reliable, and appropriate ways, the standards will have no 
practical value. 
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